softmax function
Deep Neural Nets with Interpolating Function as Output Activation
We replace the output layer of deep neural nets, typically the softmax function, by a novel interpolating function. And we propose end-to-end training and testing algorithms for this new architecture. Compared to classical neural nets with softmax function as output activation, the surrogate with interpolating function as output activation combines advantages of both deep and manifold learning. The new framework demonstrates the following major advantages: First, it is better applicable to the case with insufficient training data. Second, it significantly improves the generalization accuracy on a wide variety of networks. The algorithm is implemented in PyTorch, and the code is available at https://github.com/
- North America > United States > California > Santa Clara County > Stanford (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.68)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Asia > Singapore (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Chongqing Province > Chongqing (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (4 more...)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > South Korea (0.04)
GeneralizableMulti-LinearAttentionNetwork
The majority of existing multimodal sequential learning methods focus on how to obtain powerful individual representations and neglect to effectively capture themultimodal joint representation. Bilinear attention network (BAN) isacommonly used integration method, which leverages tensor operations to associate thefeatures ofdifferent modalities.
As stated in Section A, we apply the softmax function such thatRAPsoftmax outputs a synthetic datasetdrawnfromsomeprobabilisticfamilyofdistributionsD = n σ(M)| M Rn
Pt i=1eqi(x)(eai eqi(Di 1)) which is the exactly the distribution computed byMWEM. D(x)log(D(x)) (6) The optimization problem becomesDt = argminD (X)Lmwem(D, eQt, eAt). We show the exact details ofGEM in Algorithms 2 and 3. Note that given a vector of queries Qt = hq1,...,qti,wedefinefQt() = hfq1(),...,fqt()i. B.1 Lossfunction(fork-waymarginals)anddistributionalfamily For anyz R,G(z)outputs a distribution over each attribute, which we can use to calculate the answer toaquery viafq. Empirically,wefindthatour model tends to better capture the distribution of the overall private dataset in this way (Figure 3).
Adaptive Sampling for Efficient Softmax Approximation
The softmax function is ubiquitous in machine learning and optimization applications. Computing the full softmax evaluation of a matrix-vector product can be computationally expensive in high-dimensional settings. In many applications, however, it is sufficient to calculate only the top few outputs of the softmax function. In this work, we present an algorithm, dubbed AdaptiveSoftmax, that adaptively computes the top k softmax values more efficiently than the full softmax computation, with probabilistic guarantees. We demonstrate the sample efficiency improvements afforded by AdaptiveSoftmax on real and synthetic data to corroborate our theoretical results.